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A minimax estimator has the minimum possible error (“risk”) in the worst case. We construct the 
first minimax estimators for quantum state tomography with relative entropy risk. The minimax 
risk of non-adaptive tomography scales as 0(l/%/]V), in contrast to that of classical probability 
estimation which is 0(1/N). We trace this deficiency to sampling mismatch : future observations 
that determine risk may come from a different sample space than the past data that determine the 
estimate. This makes minimax estimators very biased, and we propose a computationally tractable 
alternative with similar behavior in the worst case, but superior accuracy on most states. 


Quantum information processing relies on physical 
qubits that store and process quantum information. Test¬ 
ing and characterizing qubit devices is the business of 
quantum tomography [1], and quantum state tomography 
in particular is used to estimate the quantum state (den¬ 
sity matrix) p produced by an initialization procedure. 
Tomography comprises two steps: (1) data gathering , ac¬ 
complished by measuring a “quorum” of different observ¬ 
ables on N samples of p; and (2) an estimator that maps 
the data to a final estimate p. The goal, of course, is an 
accurate estimate - we want p to be “close” to the true 
state p , minimizing some error metric d(p : p). 

One might thus expect that tomographers would 
choose an estimator that is optimal (or at least near- 
optimal) in accuracy. Somewhat surprisingly, this is 
not done. Although several estimators are known 
and used (linear inversion [2], maximum likelihood [3], 
Bayesian mean [4], hedged maximum likelihood [5], L\- 
regularization [6]), none of them is known to have op¬ 
timal pointwise accuracy [22] for finite N. In fact, we 
don’t even know the ultimate bounds on accuracy, which 
makes it impossible to say which of these estimators (if 
any) are “good enough”. 

We remedy this embarrassing situation in the present 
Letter by constructing minimax estimators (depicted in 
Fig. 1; see detailed explanation after Eq. 7) with 
absolutely optimal performance. These estimators are 
unwieldy, but (i) their performance yields tight upper 
bounds on accuracy, effectively delineating what “good 
enough” means, and (ii) their construction provides quite 
a lot of insight into the structure of the problem. Armed 
with these results, we show that hedged maximum like¬ 
lihood (HML) is remarkably close to optimal, and out¬ 
performs minimax for most states (though of course its 
worst-case risk is higher). We also identify a good value 
for the hedging parameter (3 that appears in HML. 
Prerequisites: Defining “optimal” requires making sev¬ 
eral choices. For example, an optimal estimator for one 
error metric d(p : p) is generally not optimal for a differ¬ 
ent metric d'(p : p). Here [7], we quantify inaccuracy by 



(b) Minimax Estimators 





FIG. 1: Estimators for Pauli measurements on a rebit, 

depicted as distortions of the “linear inversion grid” (see text 
after Eq. 7). (a) Three standard estimators, each for M = 8 
measurements of X and Y. Each vertex of the red grid corre¬ 
sponds to an estimated density matrix. Linear inversion esti¬ 
mates may he outside “Bloch disk” of physical states. MLE 
estimates are non-negative, while HML yields strictly positive 
estimates, (b) Minimax estimators for M = 8,16, 32, 64 mea¬ 
surements of A' and Y on a rebit. They are locally biased, 
toward support points of the least favorable prior. 
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the quantum relative entropy , 

d(p :p)=D(p\\p) = Tr [p(log p - log p)\ . (1) 

Like its classical analogue, quantum relative entropy 
[21] is a well-motivated measure of predictive (and 
information-theoretic) inaccuracy [4]. It quantifies the 
expected cost, resulting from an imperfect estimate, of 
imperfectly predicting measurements of p’s diagonal ba¬ 
sis (because this is the hardest measurement to predict 
accurately). 

An estimator’s pointwise risk is a function of the true 
state p and is given by the average of d{p : p) over all 
possible data sets D: 

d(p) = J2Pr(D\p)d(p:pm- ( 2 ) 

D 

In the minimax paradigm, we quantify an estimator’s 
accuracy by its worst-case risk, d max = ma x p d(p). The 
minimax risk of the estimation problem is the minimum 
achievable risk (minimized over all possible estimators), 
and a minimax estimator is one that achieves this bound. 

In most inference problems, the sample space of pos¬ 
sible observations (data) is fixed by the problem. Not so 
in quantum tomography. Quantum systems can be mea¬ 
sured in many different and incomparable ways. This is 
the single most significant difference between quantum 
and classical estimation. This freedom is often removed 
in quantum problems by choosing the best or worst possi¬ 
ble measurement (e.g., as in the definition of quantum rel¬ 
ative entropy as the classical relative entropy of the most 
difficult-to-predict measurement). This is usually not 
done in tomography, because the optimal measurements 
are far too difficult. In this letter, we follow the majority 
of experiments and analyze tomography based on Pauli 
measurements on a single qubit. However, we also prove 
analytic lower bounds on minimax risk that apply to any 
non-adaptive measurement and any d-dimensional quan¬ 
tum system. In some parts of our analysis, we use a rebit 
- a quantum system with a 2-dimensional real Hilbert 
space, whose state space corresponds to the equatorial 
plane of the Bloch sphere - as an easier-to-analyze proxy 
for a qubit. 

Minimax risk: The first main result of this Letter 
is a lower bound on the asymptotic (N —> oo) minimax 
relative entropy risk of Pauli tomography on qubits and 
rebits, 


d ma x> 4 ’ ( 3 ) 

where D = 2 for rebits and D = 3 for qubits. Its 
0(1/viV) scaling contrasts sharply with the minimax 
risk of estimating a classical bit, which is almost exactly 
0.5/IV [10, 11]. We derive this bound below by mapping 
the minimax risk of qubit and rebit state tomography to 
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FIG. 2: Numerical minimax risk for qubits, rebits, 
and noisy coins. Black curves show the risk of numerically 
constructed minimax estimators for (a) a qubit and (b) a 
rebit, as a function of the number of samples (IV), up to the 
maximum that was numerically feasible. Red curves illustrate 
the numerically-computed risk of “noisy coin” systems whose 
noise levels are chosen to match the effective “noise” of the 
qubit and the rebit (respectively). Blue lines show the the 
lower bound given in Eq. (6). 


a classical “noisy coin” model. In Figure 2, we compare 
these bounds to numerical calculations of the minimax 
risk, for small N, of qubits, rebits, and noisy coins. 

A d-dimensional quantum state is analogous in many 
ways to a classical d-outcome probability distribution. 
However, its minimax risk scales differently because of a 
phenomenon instrinsic to quantum tomography (though 
not uniquely quantum) that we call sampling mismatch: 
the sample space for the observed events is neither unique 
nor isomorphic to the underlying state space. For ex¬ 
ample, the possible statistics for the three 2-outcome 
Pauli measurements on a qubit naturally define a cube, 
whereas the possible quantum states form a sphere (the 
Bloch ball). 
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Sampling mismatch can be reproduced in a simple clas¬ 
sical model called the “noisy coin” [12]. It is a classical 
system with a 2-outcome sample space (i.e., a coin flip) 
where each observation is erroneous with known proba¬ 
bility a. Sampling mismatch arises when we attempt to 
assign probabilities to future noiseless observations using 
data from noisy measurements. The noisy coin’s min¬ 
imax risk is 0(\/\fN), because nearly-pure states are 
hard to estimate accurately from noisy statistics. The 
corresponding minimax estimators are strongly biased 
toward nearly-pure states (see [12] for details). We are 
going to use a variant of the noisy coin model to bound 
the risk of tomography. 

We define “tomography” thus: N samples (copies) of 
a single-qubit state p will be prepared; each sample will 
be measured independently (not jointly together with 
other samples) in a predefined fashion (not adaptively). 
The fcth sample is measured in an arbitrary basis, and 
this measurement can be described by a POVM (posi¬ 
tive operator-valued measure) AIfe = {II fc ,ll —II fc } whose 
outcomes have probabilities {q, 1 — q} with q = Trll^p. 
Based on the N measurement results, we report a state 
p, and seek to minimize relative entropy cost. 

Now, suppose that before analyzing the data (but af¬ 
ter choosing the measurements!) we are told the eigen- 
basis of p. This helps us (only p's spectrum must be 
estimated), so the risk of spectrum estimation is a strict 
lower bound on the risk of full tomography [23]. 

We define {|0), 11)} to be the eigenstates of p , and 
write 


P = P |0)<0| + (1 — p) 11)(11 ■ (4) 

Now, we need only estimate p G [0,1]. This parame¬ 
ter manifold is identical to that of a coin. Furthermore, 
the quantum relative entropy between two diagonal den¬ 
sity matrices is identical to the classical relative entropy 
between the corresponding distributions. So, since p's 
eigenbasis is known, estimating p is identical to estimat¬ 
ing the bias of a coin. However, unless the eigenbases of 
p and the happen to coincide, the measurement data 
obtained from the N samples of p are not “noiseless”. 
Even if p = 0 (i.e., p is pure), the data remain somewhat 
random. The probability of observing is not p, but 

q = p (0| H fc |0> + (1 — p) (1| n* |1> 

= p( 1 - 2a k ) + a k 


where the effective noise in sample k is 

a fe = (l|n fc |l) 2 . (5) 


problem’s minimax risk by 

— e~ s 1 

dmax 7= 7= j (6) 

2 

where (3 is the average resolution provided by the N noisy 
samples: 


P = 


1 

N 


N 


5 > 


1 (1 — 2 a k ) 2 

N a fc(l ~ «fc)' 


(7) 


For any fixed measurement strategy - e.g., the stan¬ 
dard one where IV/3 samples are measured in the X , Y, Z 
bases - the maximum risk occurs when we choose the 
eigenbasis of p to maximize (3 in Eq. 7. This “least fa¬ 
vorable” basis is the one that lies as far as possible from 
all measured bases. For a rebit, it lies halfway between 
the X and Z bases, and a k = ^(1 — l/y/2). For a qubit, 
it is the geometric mean of the X , Y, and Z bases, and 
a k = |(1 — l/y/3). Inserting these values for a k yields 
the final bound given in Eq. 3. 

This argument applies (qualitatively) to tomogra¬ 
phy on any finite-dimensional system with any discrete 
POVM. As long as no samples are measured in a basis 
that diagonalizes p, the minimax risk scales as 0(l/\/N) 
(although the prefactor will vary). However, if any non¬ 
vanishing fraction of the N samples are measured in a 
basis that diagonalizes p, then Eq. 6 no longer applies. 
Thus, continuous POVMs such as the unitarily invari¬ 
ant Haar-uniform rank-1 POVM (a.k.a. the uniform 
POVM), require a slightly different argument. In the 
appendix, we prove that even in this case, the minimax 
risk is lower bounded by O ((IV log V) -1 / 2 ). 

Estimators: To confirm the bound given by Eq. 3 
and explore minimax risk at small N, we use numerics to 
find minimax estimators. An estimator is a map from the 
set of all possible datasets into the set of density matri¬ 
ces. The outcomes of the measurement (s) performed are 
represented by a set of positive operators {E k }, and the 
data themselves by a set of frequencies D = {n k }. For 
qubit Pauli tomography, the data comprise M = N/ 3 
samples each of a x , a y , and a z measurements; for rebits, 
they comprise M = N/ 2 samples each of cr x and a y mea¬ 
surements. 

We used numerical optimization (over the set of pos¬ 
sible estimators) to find minimax estimators. The algo¬ 
rithms are described in the appendix. In Figure 1, we de¬ 
pict the resulting estimators, and compare them to three 
canonical estimators: 


1. Linear inversion (p LI ): The first tomographic es¬ 
timator, it is obtained by equating each probability 
Pr(fc|/3 L i) = Tr E k p L1 to its observed frequency jff. 


We can model this situation perfectly by a noisy coin (as 
in Ref. [12]) where each observation fails with a differ¬ 
ent error probability. The error probability for the Mh 
sample is a k . In the appendix, we bound this estimation 


2. Maximum likelihood (p ML ): MLE assigns the 
density matrix that maximizes the probability 
of the observed data (the likelihood), C(p) = 
Pr(D\p) = ll k [Tr(E k p )"*]. 
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3. Hedged maximum likelihood (phml,/?): The 
HML estimator maximizes the product of C(p) and 
a “hedging function” h{p) = det(p)' 3 . This function 
is strictly convex and vanishes for rank-deficient 
states, so the HML estimate is always full-rank. 

To simplify visualization, we depict rebit estimators, 
which are qualitatively similar to qubit estimators and 
easier to depict. A rebit estimator is a map from datasets 
to Bloch vectors, as p : {0,..., M} 2 — y R 2 . We use the 
linear inversion estimator as a reference. As a linear map 
from the 2-dinrensional space of datasets ({0 ... M} 2 ) 
and the 2-dimensional space of rebit states (the unit disc 
in R 2 ), the linear inversion estimator is represented by 
a uniform grid on the “Bloch square” (Fig. la). Every 
other estimator is represented as a distortion of this grid. 
The vertices of the grid are estimates p, and the position 
of such a vertex within the grid indicates what dataset it 
came from. 

Minimax estimators for N = 16, 32, 64 and 128 (total) 
Pauli measurements on a rebit are shown in Figure lb. 
The most striking feature of these estimators is a pro¬ 
nounced “ripple” phenomenon. This is not a numerical 
artifact. Instead, it represents a consistent bias toward 
certain discrete points within the state space (support 
points of the least favorable prior - see Fig. 4 in the ap¬ 
pendix), which can be identified in Figure 1 as regions 
where the grid lines cluster together. The minimax es¬ 
timator demonstrates this bias because these points are, 
in a particular sense, the most difficult to estimate accu¬ 
rately. 

Improving on Minimax: The nrinimax criterion is 
an elegant concept, but a dangerous one. In its single- 
minded quest to improve the maximum risk, it has no 
concern for the pointwise risk at states that are “eas¬ 
ier” to estimate. In such regions, it may incur extreme 
bias and inaccuracy, for the sole purpose of achieving a 
tiny reduction in the maximum risk. For quantum to¬ 
mography, this effect become extreme. While 0(1/N) 
risk can be achieved on all full-rank states, the risk is 
unavoidably 0(1/y/N) near the boundary. Our numer¬ 
ical experiments confirm that the nrinimax estimator’s 
pointwise risk is 0(1/y/N) everywhere, whereas other es¬ 
timators easily achieve 0(1/N) risk in the interior of the 
Bloch sphere (Fig. 3b). If p really was selected adversar- 
ially, then minimax would be a wise strategy. But in re¬ 
alistic cases, we would prefer an estimator that achieved 
0(1/N) scaling where possible, even at the cost of slightly 
worse worst-case behavior. 

A good estimator should achieve 0(1/N) risk in the 
interior, while coming as close as possible to minimax 
performance near the boundary. The maximum likeli¬ 
hood estimator (MLE) is disqualified because its point- 
wise expected risk is uniformly infinite (it has nonzero 
probability of returning a rank-deficient estimate for ev¬ 
ery p 1 so d(p) = oo). However, hedged maximum likeli¬ 




FIG. 3: Maximum and pointwise risk of minimax 
and HML estimators. Plot (a) shows the maximum risk, 
for qubit tomography, of the minimax estimator and three 
different HML estimators (i3 = 0.01,0.04,0.10) for N < 192 
samples distributed equally among the 3 Pauli bases. Plot 
(b) shows the pointwise risk, along the axis oriented at 45 
degrees to both X and Y, of the same estimators for N = 128 
samples for a rebit (this minimax estimator is depicted in 
Fig. lb). The two local maxima of d(p) are at r = 1 and 
r m 1 — Jjj . Choosing ~ 0.04 balances these risks, and 
is therefore minimax among HML estimators. This optimal 
HML estimator comes very close to matching the worst-case 
performance of the minimax estimator, and outperforms it 
dramatically in the interior of the state space. 

hood (HML) does not have this behavior. Introduced in 
Ref. [5] as a full-rank alternative to MLE, HML general¬ 
izes classical “add-/3” estimators. Like them, it never as¬ 
signs zero probabilities, and has a adjustable parameter /3 
that governs how much it avoids zero eigenvalues. Clas¬ 
sical “add-/I” estimators are very nearly nrinimax (for 
fd ~ 1/2), which suggests that HML estimators might 
have similar near-optimality properties. 

All HML estimators have good behavior (0(1/N) 
pointwise risk) in the interior, so we are free to define the 
“optimal” /? by minimax (among HML estimators). As 
illustrated in Fig. 3b, an HML estimator’s pointwise risk 
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has local maxima at the boundary (pure states) and/or 
at a slightly depolarized state (with purity ~ 1 — 1/y/N). 
To minimize its maximum, we choose ft to equalize the 
risk at these two local maxima. The asymptotically op¬ 
timal ft for the noisy coin model was shown in Ref. [12] 
to be /3optimai ~ 0.0389, and our numerics confirm that 
ft ~ 0.04 is optimal to within the available numerical pre¬ 
cision for rebit tomography as well (Fig. 3b; qubit results 
for smaller N are not shown, but confirm that ft ss 0.04 
has nearly-minimax performance). 

For this near-optimal value of ft, HML compares fa¬ 
vorably with mininrax estimators. Its worst-case risk is 
very close to the mininrax risk (Fig. 3a), and it dramat¬ 
ically outperforms minimax in the interior of the state 
space (Fig. 3b). So while optimal hedging estimators do 
not offer strictly optimal performance by any criterion, 
they are (i) easy to specify and calculate, (ii) close to 
mininrax, and (ii) more accurate than minimax estima¬ 
tors for almost all states p. We do not why the opti¬ 
mal ft is so different for noiseless coins (ss 0.5) and for 
qubits/rebits/noisy coins (« 0.04), but it suggests funda¬ 
mental differences between noiselessly sampled systems 
and those (like qubits and noisy coins) where sampling 
mismatch is important. 
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tion, for the U.S. Department of Energy’s National Nu¬ 
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The minimax risk of a noisy coin 


In this appendix we show that the minimax risk of estimating the bias of a noisy coin is 0(l/y/N) (in contrast to 
the 0(1/N) minimax risk for a noiselessly observed coin), and derive a simple lower bound on it. 

Now, suppose a coin with bias p = Pr(“heads") is flipped N times and a sequence of binary outcomes n = {n k } are 
recorded. But these observations are unreliable; each outcome is recorded incorrectly with trial-dependent probabilities 
a. = {cifc} (all taken from the interval [0,1)). The distribution of the outcomes is 


N 


N 


Pr(n|p,a) = Pr(n k \p,a k ) = JJ q k k ( 1 - q k ) 1 nk , 

k =1 k =1 

where the probability of observing “heads” on trial k is not p, but 

q k (p) = u k +p( 1 - 2a fc ) =p + a k (l-2p). 


( 8 ) 


(9) 


We recover a standard noiseless coin when a k = 0 for all k. 

For each prior distribution p(p), the estimator with the smallest risk (expected cost) is the Bayes estimator for 
p(p), and its risk is the Bayes risk of p(p). Bayes estimators need not be simple, but because relative entropy is a 
Bregman divergence , the Bayes estimator is always the mean of the posterior distribution (obtained via Bayes’ Rule) 
[4]. The prior with the highest Bayes risk is the least favorable prior, and its risk is the minimax risk. Thus, the 
Bayes risk of any prior is a lower bound for the minimax risk, which suggests an obvious variational approach to 
bounding the minimax risk by choosing a prior whose risk is high but easy to calculate. Obviously, some priors have 
very low risk [e.g., p(p) = 5(p — po)], and provide useless lower bounds. A common approach is to use the uniform 
(Lebesgue) prior, but for the noisy coin this prior actually has rather low [0(l/iV)] risk. So instead, we consider the 
set of bimodal priors, 


r 0) = 


8(P ~ Po) + $(P - Pi) 


( 10 ) 


(Varying the weights yields a slightly less favorable prior, but doesn’t change the asymptotic scaling). We will choose 
Po = 0 and pi ~ 1/yfN. 

The risk of 7r is given by d(n) = [d(0) + d(p±)\/2, and by observing that d(p\) > 0 we obtain the lower bound 


d(n) > ^d( 0) = ^E„| p=0 [£>(0||£(n))], 


where the Bayes estimator is the posterior mean, given by 

pi Pr(n|pr) 


p(n) = 


Pi 


in terms of the likelihood ratio 


Pr(n|pi) + Pr(n|0) 1 + A(n) ’ 

Pr(n|0) 

A (n) = —-—:—-. 

Pr(n|pi) 

We can lower-bound the relative entropy term by U(0||p) = — log(l — p) > p, so 

d( 0) > E„| p =o[p(n)] = PiE n |p=o Y+Jfnj 

If we define A (n) = —2 log A (n) and apply Jensen’s inequality, we obtain 

d(0) >p ie -5 E "^=°[ A(ri)1 . 


( 11 ) 


( 12 ) 


(13) 


(14) 


(15) 


Next, we perform a Taylor expansion of the expectation E n | p=0 [A(n)] around pi = 0. The derivatives of the likelihood 
function (8) are 


logPr(n|p!) = 


d 

dpi 

q2 _^ 

^logPr^lPi) = E (- 


n k (l — 2a k ) (1 - n k )(l - 2a k ) 


q k 

n k ( 1 - 2a k ) 2 


1 - q k 

(1 - n k )( 1 - 2a k ) 2 
(1 ~ Qk) 2 


(16) 

(17) 
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Evaluating these at pi = 0 and taking the expectation E„| p=0 [rik} = ak, we have 

* d 

Etl| p =o ~k— logPr(n|pi) =0, 

-P 1 Jpi=0 


E, 


•n\p=o 


— logPr(n|pi) 

ydpi 

d 2 , p / | \1 V " (1 ~ 2 a /;) 2 


Putting everything together in the Taylor series, we obtain 


E„ b=0 [A(n)]-53 in _ 


- O(v^). 


(18) 

(19) 


( 20 ) 


where the 0(pf) term does not scale with N w/r.t. the leading order term. 
To simplify this quantity, we define the per-sample “resolution” /?*., 


Pk = 


(1 - 2 a k ) 2 

(1 &k) 


(which, at least for small au , is approximately l/a*,). The expectation value in Eq. 20 can be written concisely in 
terms of the average /3, ft = N~ l J2k= l Pk: as 

E n |p=o [A(n)] ~ Npp\. 


Finally, we set 

1 1 

" 7 


( 21 ) 


which ensures that p\ —> 0 as N —» oo and justifies truncating the series expansion Eq. 20 above at leading order. 
This yields a lower bound on the minimax risk of 


dmax 2^ d{i r) 


1 1 

2y/epVN' 


( 22 ) 


It is worth noting that the risk is not determined by the average value of a (the per-sample noise probability), but by 
the average of /?, which behaves roughly like 1/a. In particular, if any constant fraction of the samples are observed 
noiselessly, then those samples have p = oo, and they dominate the minimax risk - p —► oo, and the minimax risk 
collapses to 0(1/N), as is appropriate for a noiseless coin. 


Minimax risk for quantum tomography 

In this section, we derive a lower bound for the minimax risk of qubit state estimation using the same framework 
that we used for the noisy coin. The difficulty in doing this is that a qubit’s state space (the Bloch sphere) is more 
complex than that of a coin - instead of a single parameter ( p) there are three ( x,y,z ). However, the minimax risk 
is dominated by (i) states p that are very close to pure, and (ii) errors in estimating the spectrum of p (rather than 
errors in its eigenvectors, which contribute much less to the risk). This observation allows us to simplify the analysis 
greatly by choosing a bimodal prior for the qubit, supported on two states that differ only in their eigenvalues. In this 
circumstance, each measurement provides information equivalent (in its effect on the final estimate) to a noisy coin 
flip whose noisyness depends on what was measured (or, most generally, on which outcome was observed). Because 
we have chosen a very simple prior that is not least favorable, our analysis only guarantees a lower bound. However, 
it captures the dominant component of the minimax risk, and (for such a simple model) turns out to be surprisingly 
close to tight. 

Suppose we are given N samples of a qubit state p. The state is drawn from a bimodal prior supported on (i) a 
pure state po = \ip)(ip\, and (ii) a slightly more mixed state p\ = (1 — p\) \^){^\ +pi(ll — |^)(V>|): 

Ap) = \ (HP ~ Po) + S(p - pi )). 


(23) 










We will specify | ip) and p\ ss 1/y/N later. Each sample is measured in some basis; on the fcth sample we perform the 
[orthogonal basis] POVM {\(j>k){(t>k \, 1 — and list the outcomes as a binary vector n := {n k }- 

The likelihood function for a single observation is 


Pr(nfc|po) = I (</>*#) I 2 = : a k 

Pr(n fe |Pi) = (1 - 2pi)| {4>k\4>) | 2 +Pi = (1 - 2a fc )pi +a k - 
These are identical to the likelihoods for the noisy coin. The Bayes estimator is 

= [Pr(n|p 0 ) + (1 - 2pi)Pr(n|pi)] \i/j)(i/j\ + Pl ) Pr^pi)!! 

Pi'MPo) + Pr(n| / oi) 

Now, to compute the expected risk, we observe that the Bayes estimate is always of the form p = a |'0)('0| + /3J, with 

[Pr(n|p 0 ) + (1 - 2pi)Pr(n| ( oi)] piPr(n|pi) 


(24) 

(25) 


(26) 


a = 


L , P = 


Pr(n|p 0 ) +Pr(n|pi) ' r Pi'MPo) + Pr( n ki) ’ 

and for any such mixture a = a IkXX’l + fll, the relative entropy can be computed as 

d (kXX’l Ik) = ^ (X’l logcr \ip) 

= - (X’l log(a + /?) \m\ + log/3(l - IV’XV’I) IX’) 

= -log(a + /3). 


Thus, in the limit of p\ 


0 and N —> oo, the risk given that p = po is given by 

’Pr(n|po) + Pr(™|pi) (1 - pi) 


D(po\K n )) = - log 
= - log 


= P l 


Pr(n|p 0 ) + Pr(n|pi) 
pi Pr(w|pi) 
Pr(n|p 0 ) +Pr(n|pi)_ 

Pr(n|f,l) o«). 


1 - 


Pr (ri|po) + Pr(w|pi) 


(27) 


(28) 

(29) 

(30) 


(31) 

(32) 

(33) 


This is identical to the risk of the noisy coin. 
As in Eq. 21, we choose 


Pi = 


1 1 


(34) 


where /3 is defined in Equation 7 as 


This yields a near-final lower bound of 


N N 

= N^ Pk= N^ 


k =1 


(1 - 2 Qfc ) 2 

N k^l ak ^ “ a k" 


dmax > r(7r) > 


e 2 l 

2 t/pVn' 


(35) 


To obtain a concrete lower bound on the risk, we must choose IX’XV’I- To ensure that the average resolution [3 is as 
small as possible near \ip), we want a k to be uniformly large. For the case of a qubit or a rebit, the solution is to pick 

the state “furthest away” from all the measurement axes, which yields a k = \ ^1 — ks) w ^ ere D = 2 for a rebit and 
D = 3 for a qubit. This yields the simple result = 4/(Zi> — 1), and therefore 


dmax > r(ir) > 


e 2 k-D — 1 
4 y/N 


(36) 


This argument can be extended to any discrete POVM, by choosing | ip) so that it is not orthogonal to any effect of 
the POVM. Then a k for each k will be lower-bounded by the minimum overlap of | ip) with any effect, and /3 will be 
finite, and so the minimax risk will scale as 1/vXV. 
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But, by exactly the same argument, the best nonadaptive tomographic measurement must be unitarily symmetric. 
And since it must also be rank-1, it is the Haar-uniform POVM whose effects include all pure states \4>)(<j>\ with the 
unitarily invariant measure. The analysis given so far breaks down for this uniform POVM, because no matter what 
| ip) we choose, the measurement has effects that diagonalize it. The effective “noise” 


Pr(n fc |p 0 ) = I (</*#) I 2 = : a k 


is distributed uniformly over [0,1]. If we attempt to replace the sum 


y (1 ~ 2 a k ) 2 

a k(l ~ Oi k ) 




(37) 


(38) 


with its average, then the integral diverges and our lower bound collapses to d > 0. 

Instead, we observe that the Haar-uniform POVM can be described as a two-step process: (1) choose a Haar 
uniform-basis, and (2) measure in that basis. So, a tomography experiment involving N samples can be described by 
a sequence [aq, ct 2 , • ■. ajv], in which each ak is drawn from the uniform distribution over [0,1]. The minimax risk is 
the [probability-weighted] average over all such sequences. We divide them into two subsets: those in which all the 
{at} lie in the interval 


oik G 


1 

2iV’ 1 


1 

2 N 


and those in which at least one does not. 

The probability that any given ak lies outside the interval is exactly 1/1V, so all of them lie within it with probability 


P = 



1 

e 


Conditional on all the {aq.} lying within the interval, the minimax risk can be lower bounded by integrating /3 over 
the interval, which yields 


(3 = 2 log N + 0(1). 


(39) 


This happens with probability at least 1/e, so a lower bound on the minimax risk for the Haar-uniform POVM (as 
N —>• oo) is 


d > 


1 1 

(2e) 3 /2 yWIogW' 


(40) 


Least favorable priors 

The “Optimization Toolbox” in MATLAB 2011a, for example, contains a method fminimax which directly solves 
the optimization problem we are interested in. However, finding minimax estimators by brute force seems impossibly 
difficult. There are uncountably many estimators; each one is a density-matrix valued function on the set of all 
possible datasets. Each estimator’s performance is quantified by maximizing its risk profile d(p) over all density 
matrices p. Even computing the maximum risk of a single specified estimator is nontrivial; finding its minimum over 
the uncountable set of all estimators seems intractable. 

Fortunately, we have some useful mathematical tools that simplify matters greatly (see, for example, [8]): 

1. The minimax estimator is also the Bayes estimator for some measure. This fact is called Minimax-Bayes duality , 
and the measure in question is called a least favorable prior (LFP). 

2. Relative entropy is a Bregman divergence (a.k.a. strictly proper scoring rule), and therefore the Bayes estimator 
for any given measure p is Bayesian mean estimation (BME). 

3. The least favorable priors for this problem are (empirically) always discrete , with a finite number of support 
points. This is not proven, but it is often the case in similar problems, and is easy to verify numerically for this 
problem. 
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Minimax-Bayes duality is enormously helpful, both as a technical tool and as an aid to problem-solving. The reasoning 
behind this duality is fairly straightforward: 

1. Any estimator involves trade-offs in accuracy, which are quantified by its risk profile d{p). For example, the 
constant estimator /5(D) = po is exceptionally accurate if the true state happens to be pol That is, d(po) = 0. 
No other estimator can match its accuracy at po■ But there is a price to be paid; d(p) is dreadfully high for any 
state p that is far from pq. 

2. Averaging d(p) over a measure p quantifies these tradeoffs. In order to minimize that average, the Bayes 
estimator for p must achieve fairly low expected risk in regions where p is concentrated, but can tolerate high 
risk where p is sparse. 

3. If we consider the Bayes estimator for a specific measure po, its risk profile d(p) will typically be non-constant 
- so it will have at least one maximum, which we denote p$. Now suppose that we modify po (to p') by slightly 
increasing the probability density around po■ The new measure p' will have a higher Bayes risk (since po has 
higher-tlian-average risk, and is now slightly more probable). But the Bayes estimator for p' will be slightly 
different as well; it will achieve a lower value of d(po) because by increasing the probability of po we have 
increased the value of achieving low risk at po- 

4. Iterating this process defines a flow - probability flows towards high-risk states (decreasing their expected risk) 
and away from low-risk states (increasing their expected risk). Every step in this flow defines a new prior (and 
its associated Bayes estimator) with higher Bayes risk and lower maximum risk. 

5. If p is a stationary point of this flow, then the risk profile of its Bayes estimator p M (D) is: (i) equal to a constant 
C on the support of p, and (ii) no greater than C at every point not in the support of p. This estimator is 
necessarily minimax, because: 

• No estimator can achieve lower average risk on p (by the definition of Bayes estimator), 

• So no estimator can achieve lower maximum risk on the support of p (since p M (D)’s risk is constant), 

• And therefore no estimator can achieve lower maximum risk over all states (since “all states” is a superset 
of p’s support). 

6. Such a stationary measure can occur in one of two ways. Either p^(D) has constant risk on all states, or p is 
supported on a discrete set. (Because d(p) is analytic, it cannot be constant over a limited range, which means 
either it is constant everywhere or it has discrete maxima). 


Numerical recipes 

The above argument is the basis for the numerical algorithm we used to compute LFPs, and hence the minimax 
estimators. Our results were generated using an implementation of Algorithm 1, a variant of the one given by 
Kempthorne [19]. Although this algorithm is drastically more efficient than the brute force approach, it is still 
insufficient to extract the asymptotic form of the scaling for the risk of minimax estimator. In particular, it takes 
~week to compute the LFP shown in Figure 4 using an implementation of Algorithm 1 in MATLAB 2011a on four 
2.2GHz processors. 

To remedy this, we devised an efficient Monte Carlo algorithm (Algorithm 2). The core insight is that varying only 
the weights of the prior renders maximizing the Bayes risk a convex optimization problem. The algorithm proceeds 
by randomly choosing n states according to the Hilbert-Schmidt prior. Then, the Bayes risk is maximized keeping 
the location of the states fixed. Both upper and lower bounds on the minimax risk can be obtained. If these are not 
close, then we resample near those points whose weights have not be set to zero and repeat the process. 

Least favorable priors produced by Algorithm 2 are noticeably different from the (more) exact solutions obtained 
by Algorithm 1 (Fig. 4). However, the corresponding Bayes estimators are nearly identical, and these LFPs yield 
very tight upper and lower bounds on d max (see Figure 5). We conclude that the minimax risk is very insensitive 
to certain variations in the prior. This explains the discrepancies in the LFPs obtained via Algorithms 1-2, and also 
justifies our use of the estimators and risks obtained via Algorithm 2. Using this algorithm, we were able to find 
good approximations to the minimax risk up to N = 192, but this is still insufficient to clearly show the asymptotic 
behavior of d max . For that purpose, we developed the “noisy coin” model. 
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FIG. 4: Here we show the support points of numerical approximations to least favorable priors (LFPs) for IV = 16 (M = 8) 
Pauli measurements on a rebit (a qubit with the constraint ( a y ) = 0. The weights on these points are not uniform, but we 
shown a Gaussian kernel density estimate of them on the right. The LFP found using the highly accurate Algorithm 1 is 
supported on the large black dots, while the one found using the much faster Algorithm 2 is supported on the smaller gray dots. 
Note that in this case (and all others where we could use Algorithm 1), while the LFPs are evidently different, the resulting 
minimax risks are indistinguishable. We conclude that the maximum risk is insensitive to certain visible variations in the prior. 


# Deterministic 

• Monte Carlo 



FIG. 5: Comparison of the minimax risk computed using Algorithms 1 and 2. The minimax risk of qubit and rebit tomography 
with N Pauli measurements (N = 3... 192 for qubits and N = 2... 512 for rebits) was computed by finding least favorable 
priors, using Algorithm 1 (large black dots) and Algorithm 2 (small gray dots). In all cases where both algorithms could be 
applied, results agreed to high precision. 
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Algorithm 1 Kempthorne (deterministic) algorithm for finding the least favorable prior [19]. 
Input: Number of measurements IV > 0. 

Input: Support points of initial guess prior ag, i £ {1,..., n}. 

Input: Probability weights of the support points ng, i £ {1,..., n} such that JA w% — 1. 

Input: Tolerance tol > 0. 

Input: Mixing parameter a. 

Output: Least favorable prior {agin} with m> n support points. 

Output: Lower bound on the minimax risk av_risk. 

Output: Upper bound on the minimax risk max_risk. 
function DeterministicLFP( 1V, {agin},tol) 

diff 4 — tol + 1 
while diff > tol do 

{a;, in} 4— prior with same number of support points which maximizes the Bayes risk 
av_risk 4— the maximum value of the Bayes risk for the prior found above 
maxjrisk 4— global maximum of risk using the Bayes estimator of the ({agin}) 
diff 4— |av_risk — max_risk|/av_risk 
if diff > tol then 

Add a new support where the maximum risk is attained 

V-lcngth( :r.) ^ Cl 

for each i < length(a:) — 1, «g 4 — «g — a/(length(a:) — 1) 

end if 
end while 

return {ag in}, av_risk, max_risk 
end function 


Algorithm 2 Monte Carlo algorithm for finding the least favorable prior. 
Input: Number of measurements N > 0. 

Input: Number of support points n > 0. 

Input: Tolerance on accuracy tol > 0. 

Input: Tolerance on the weights to remove supports weight_tol > 0. 

Input: Number of support points to add at each iteration m > 0 for each current support point. 
Input: Variance of normal distribution to sample new points from a. 

Output: Least favorable prior {agin} with m> n support points. 

Output: Lower bound on the minimax risk av_risk. 

Output: Upper bound on the minimax risk max_risk. 
function MCLFP(A, n,tol,weight_tol) 
diff 4 — tol + 1 

{agin} 4 — uniform distribution («g = 1/n) sampled according to uniform distribution over x 

while diff > tol do 

in weights which maximize the Bayes risk keeping the support points x fixed 
av_risk 4- the maximum value of the Bayes risk for the prior found above 
max_risk 4— global maximum of risk using the Bayes estimator of the ({agin}) 
diff 4 — jav_risk — max_risk|/av_risk 
Remove all ag such that < weight_tol 
if diff > tol then 
for each ag left do 

Add m new support sampled randomly from A/"(ag, a) 

end for 

each Wi 4 — l/length(a;) 

end if 
end while 

return {ag in}, av_risk, max_risk 
end function 








